July 6, 2019

Preview

  1. Introduction to single-cell RNA-seq
  2. Quality control and normalization
  3. Survey of downstream analysis methodology

Getting that count matrix

  • standard QC on raw reads (adapter/quality trim)
  • mapping to reference genome - bulk tools appropriate (e.g. Salmon, STAR, bowtie, etc)
  • count reads per transcript/gene - bulk tools widely used (e.g. Subread, RSEM), specialized tools available (e.g. Alevin for droplet data)
  • collapse UMIs & demultiplex (if applicable) -> scPipe

R: SingleCellExperiment Class

SingleCellExperiment Example

## class: SingleCellExperiment 
## dim: 23918 192 
## metadata(0):
## assays(1): counts
## rownames(23918): ENSMUSG00000103377 ENSMUSG00000103147 ...
##   ERCC-00171 CBFB-MYH11-mcherry
## rowData names(13): GeneLength ENSEMBL ... total_counts
##   log10_total_counts
## colnames(192): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1
##   SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ...
##   SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1
##   SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1
## colData names(48): Plate Oncogene ...
##   pct_counts_in_top_500_features_ERCC PlateOnco
## reducedDimNames(0):
## spikeNames(1): ERCC

This SingleCellExperiment object has 23918 genes and 192 cells.

Quality control metrics

  • library size (= total number of reads/UMIs)
  • number of expressed features / detection rate / dropout
  • expression levels by gene
  • proportion of reads mapped to spike-ins (if available) or MT genes
  • [Droplet] remove empty droplets
  • [Droplet] remove doublets

Library size

Number of expressed features

Detection rate

Expression levels by gene - highest

Expression levels by gene - average

Proportion of reads mapped to endogenous transcripts

[Droplet] Remove empty droplets

  • Ideally each droplet contains one cell and one bead
  • Process of adding cells (and beads) to droplets is stochastic, approximated by a Poisson process: \[ P(k \text{ cells in a droplet}) = \frac{e^{-\lambda}\lambda^k}{k!}\]
  • Typically cells are added at a low rate (\(\lambda\) << 1) to reduce the chance that a droplet contains multiple cells
  • Unfortunately, this means many droplets contain zero cells
  • Also unfortunately, reads still map to droplets with no cells

Simple idea: empty droplets have fewest reads

More sophisticated idea: EmptyDrops

  • Lun et al. 2019, (https://doi.org/10.1186/s13059-019-1662-y)
  • Key idea: expression profile of empty droplets \(\sim\) ambient RNA - Estimate ambient RNA profile by pooling across all cells/barcodes
  • Ambient expression \(A_g\) for gene \(g\) is sum of counts \(y_{bg}\) over all barcodes \(b\): \[ A_g = \sum_b y_{bg} \]
  • Counts for barcode \(b\) conditional on total counts for barcode \(b\) ~ Dirichlet-multinomial
    • allows for overdispersion (analogous to Gamma-poisson)
  • Barcodes with strong deviations (permutation test on likelihood) are determined to be cells

EmptyDrops rescues cells with few reads

[Droplet] Remove doublets

  • Even with the low cell rate, occasionally droplets contain two or more cells
  • Can also happen in plate-based protocols (though rare) due to improper cell sorting/capture
  • Simple idea: remove barcodes with especially high numbers of reads - works poorly with diverse cell types

More sophisticated idea: cluster mixtures

Another sophisticated idea: simulate doublets

Normalization

  • Now that we’ve removed problematic cells, need to put counts on a comparable scale before/during downstream analysis
  • Main factors to account for:
    • library size / sequencing depth
    • batch effects (known or unknown)

Review: Sequence more, obtain more reads

  • DESeq2 library size factors use the median ratio method - Anders and Huber 2010 (https://doi.org/10.1186/gb-2010-11-10-r106): \[ {s}_b = \text{median}_g \frac{y_{bg}}{(\prod_{j=1}^B y_{gj})^{1/B}} \]
  • assumes majority of genes are not differentially expressed
  • adjusted counts \(= \frac{y_{bg}}{s_b}\)

Median ratio method in single-cell

  • Consider this quantity in single-cell: \[ {s}_b = \text{median}_g \color{red}{\frac{y_{bg}}{(\prod_{j=1}^B y_{gj})^{1/B}}} \]
  • For every gene \(g\) where \(y_{bg} = 0\) for any \(b\), the highlighted quantity will be undefined
  • This means that the quantity is only computed for genes that are nonzero in every cell
  • As a result, CPM (counts per million) normalization commonly used for single-cell: \[ s_b = \frac{\sum_b y_{bg}}{10^6} \]

Pooling & deconvolution normalization

  • scran: Lun et al. 2016 (https://doi.org/10.1186/s13059-016-0947-7) proposed a pooling & deconvolution approach

  • Cells are pooled together and normalized against a global pseudoreference, then deconvolved by solving a system of linear equations

Quantile regression normalization

  • SCnorm: Bacher et al. 2017 (https://doi.org/10.1038/nmeth.4263)
  • Key idea: groups of genes get separate size factors based on observed count-depth relationship
  • Ignores zeroes (developed for plate-based read count data)

SCnorm reduces spurious discoveries

Related idea: NB GLM residuals normalization

scTransform reduces correlation with depth

Batch effects (equal cell composition)

  • When batches are known and cell populations are identical across batches \(z_b\) (e.g. evenly split across two plates & homogeneous), can estimate effects due to batch

  • limma: Ritchie et al. 2015 (https://doi/org/10.1093/nar/gkv007)
    • fit a linear model to log-normalized counts to estimate batch effects \(\beta_2\) and effects of interest \(\beta_1\): \[ log(y_{bg}) = \beta_0 + \beta_1 x_b + \beta_2 z_{b} \]
    • remove batch effect: \[ log(y_{bg}) - \beta_2 z_{b} \]
  • see also Combat: Johnson & Rabinovic 2007 (https://doi.org/10.1093/biostatistics/kxj037), implemented in sva

Batch effects (unknown cell compositions)

MNNcorrect can align heterogeneous batches

Variance stabilizing transformations

  • Review: a VST removes the dependence of the variance on the mean
  • It’s useful to a apply variance stabilizing transformation before:
    • visualization
    • selecting highly variable genes
    • using downstream analysis tools that don’t explicitly model count data (e.g. MAST, scDD)
  • Examples:
    • log (ideal when relative differences are important rather than absolute; requires addition of a pseudocount)
    • DESeq2::vst (estimates and removes trend of mean and variance)
  • Not necessary if data is already log CPM normalized

Example: log and DESeq2::vst

Should we impute zeroes?

Imputation can have unintended consequences

R tools for quality control and normalization

  • scater: visualization, quality control (Bioconductor)
  • scran: normalization, doublet detection, batch effect correction (Bioconductor)
  • SCnorm: normalization (Bioconductor)
  • sctransform: normalization (CRAN)
  • DropletUtils: removal of empty droplets (Bioconductor)
  • Seurat: normalization (CRAN)

drawingdrawingdrawing

Install packages we’ll use in the first set of exercises

# workflow that includes many of the packages we'll explore
BiocManager::install("simpleSingleCell")
BiocManager::install("simpleSingleCell", dependencies = "Suggests")

# individual package not included in the above
BiocManager::install("SCnorm")